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Abstract 

Chow and Liu [2] introduced an algorithm for fitting a multivariate distribution with a tree (i.e. a density 
model that assumes that there are only pairwise dependencies between variables) and that the graph of 
these dependencies is a spanning tree. The original algorithm is quadratic in the dimesion of the domain, 
and linear in the number of data points that define the target distribution P. This paper shows that for 
sparse, discrete data, fitting a tree distribution can be done in time and memory that is jointly subquadratic 
in the number of variables and the size of the data set. The new algorithm, called the acCL algorithm, 
takes advantage of the sparsity of the data to accelerate the computation of pairwise marginals and the 
sorting of the resulting mutual informations, achieving speed ups of up to 2-3 orders of magnitude in the 
experiments. 
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1 Introduction 

Chow and Liu [2] introduced an algorithm for fitting a 
multivariate distribution with a tree, i. e. a density 
model that assumes that there are only pairwise depen- 
dencies between variables and that the graph of these 
dependencies is a spanning tree. Recently, the interest 
in this algorithm, hitherto called the CL algorithm, has 
been revived by the introduction of several probabilistic 
models that extend the spanning tree model. The mix- 
ture of trees of [8, 9] is a mixture model in which each 
component is a spanning tree with a possibly different 
topology over the set of variables. The TANB (Tree Aug- 
mented Naive Bayes) model of [5] is a mixture of trees 
designed for classification. Each class density is modeled 
by a tree; these trees are combined in a mixture where 
the mixing proportions represent the prior probabilities 
of the class variable. The name of the classifier reflects 
the fact that it is an extension of the Naive Bayes classi- 
fier of [1]. The CL algorithm as well as its extensions for 
the mixture of trees and for the TANB model can handle 
forests (acyclic graphs that are disconnected) but for the 
sake of clarity throughout this paper I will assume that 
all the trees that are learned are spanning trees. Also, as 
in the original work of Chow and Liu, we consider that 
all the variables are discrete, with finite range. 

In the framework of graphical probability models tree 
distribution enjoy many properties that make them at- 
tractive as modeling tools: they are intuitively appeal- 
ing, have low complexity and yet a flexible topology, 
sampling and computing likelihoods for trees are lin- 
ear time, efficient and simple algorithms for marginal- 
izing and conditioning exist. Mixtures of trees enjoy all 
the computational advantages of trees and, in addition, 
they are universal approximators over the the space of 
all distributions 1 . 

Moreover, trees are one of the few classes of graphical 
models for which structure search is tractable [6]. The 
CL algorithm finds the structure and parameters of the 
tree T that best fits a given distribuition P as measured 
by the Kullback-Liebler (KL) divergence 



KL{P\\T) 



£p ( aOlog£M 



(1) 



For this purpose, the algorithm uses the mutual in- 
formation I uv under P between each pair of variables 
u, v in the domain. When P is an empirical distribu- 
tion obtained from and i.i.d. set of data, the computa- 
tion of all the mutual information values requires time 
and memory quadratic in the number of variables n and 
linear in the size of the dataset N. Then a Maximum 
Weight Spanning Tree (MWST) algorithm [3] using the 
computed mutual information as edge weights finds the 
optimal tree structure. Finally, the parameters of the 
distribution are assigned by copying the values of the 



1 over discrete domains 



marginals of P corresponding to each edge of the tree. 
The most computationally expensive step in fitting a tree 
to data is the first step - computing the pairwise mutual 
informations in the empirical distribution. The time and 
memory requirements of this step are acceptable for cer- 
tain problems but they may become prohibitive when 
the dimensionality n of the domain becomes large. 

An example of such a domain is information retrieval. 
In information retrieval, the data points are documents 
from a data base and the the variables are words from 
a vocabulary. A common representation for a document 
is as a binary vector whose dimension is equal to the 
vocabulary size. The vector component corresponding 
to a word v is 1 if v appears in the document and 
otherwise. The number N of documents in the data base 
is of the order of 10 3 - 10 4 . Vocabulary sizes too can be 
in the thousands or tens of thousands. This means that 
fitting a tree to the data necessitates n 2 ~ 10 6 — 10 9 
mutual information computations and n 2 N ~ 10 9 — 
10 12 counting operations, a number much too large for 
many of todays applications. 

The present work shows how to improve on the time 
and memory requirements of the CL algorithm in or- 
der to allow the use of tree density models for higher 
dimensionality domains. It will also show how this im- 
provement can be carried over to learning mixtures of 
trees or TANB models. 

It is obvious that in general the proportionality with 
the size of the data set N cannot be improved on, since 
one needs to examine each data point at least once. 
Hence the focus will be on improving on the dependency 
on n. Here the situation is as follows: there are several 
MWST algorithms, some more efficient than others, but 
all the algorithms that I know of run in time at least 
proportional to the number of candidate edges. For our 
problem this number is n(n — l)/2 which would result 
in an algorithm that is at least quadratic in the number 
of variables n. But we also have additional information: 
weights are not completely arbitray; each edge uv has a 
weight equal to the mutual information I uv between the 
variables u and v. Moreover, in the information retrieval 
example described above, the domain has a particularity: 
each document contains only a relatively small number 
of words (of the order of 10 2 ) and therefore most of the 
components of its binary vector representation are null. 
We call this property of the data sparsity. Can we use 
these facts to do better? 

The answer is yes. Remark that the only way the 
MWST algorithm uses the edge weights is in compar- 
isons. The idea the present work is based on is to 
compare mutual informations between pairs of variables 
without actually computing them when this is pos- 
sible. This will result in a partial sorting of the edges 
by mutual information. A second idea is to exploit the 
sparsity of the data to speed up the most expensive step 
of the algorithm: computing the marginals for all pairs 



of variables. Combining the two will result in an algo- 
rithm that (under certain assumptions) is jointly sub- 
quadratic in N and n w.r.t. both running time and 
memory. This algorithm, that we call accelerated CL 
algorithm (acCL) will be presented and analyzed in the 
rest of this paper. 

I will start by briefly presenting the CL algorithm and 
the notation that will be used (section 2). The next sec- 
tion, 3, introduces the assumptions underlying the acCL 
algorithm. Section 4 develops the algorithm proper, in 
the special case of binary variables. The generalization 
to variables of arbitrary arity, to non-integer counts (as 
in the case of the EM algorithm) and a discussion of the 
usage of the acCL algorithm in conjunction with priors 
follow in sections 5, 6 and 7 respectively. Finally, exper- 
imental results are presented in section 8. 

2 Fitting a tree to a distribution 

In this section we introduce the tree distribution, the 
tree learning algorithm called the CL algorithm and the 
notation that will be used throughout the paper. For 
more detail on these topics the reader is recommended 
to consult [2, 8, 7). 

Let V denote the set of variables of interest, \V\ = n. 
Let v denote a variable in V, r v its number of values , 
x v a particular value of v and x an n-dimensional vector 
representing an assignment to all the variables in V. 

2.1 Tree distributions 

We use trees as graphical representations for families 
of probability distributions over V that satisfy a com- 
mon set of independence relationships encoded in the 
tree topology. In this representation, an edge of the tree 
shows a direct dependence, or, more precisely, the ab- 
sence of an edge between two variables signifies that they 
are independent, conditioned on all the other variables 
in V. We shall call a graph that has no cycles a tree 2 
and shall denote by E its edge set. 

Now we define a probability distribution T that is con- 
formal with a tree. Let us denote by T uv and T v the 
marginals of T: 



T v (x v ) = V] T(x Vl x v _{ v }). 



Let degu be the degree of vertex v, e.g. the number of 
edges incident to v € V. Then, the distribution T is 
conformal with the tree (V, E) if it can be factorized as: 



T(x) 
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(2) 



2 In the graph theory literature, our definition corresponds 
to a forest. The connected components of a forest are called 
trees. 



The distribution itself will be called a tree when no con- 
fusion is possible. 

2.2 Mixtures of trees 

We define a mixture of trees to be a distribution of the 
form 



Q{x) = ^> fe T fe ( a 



(3) 
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with 



A fe >0, fc = l,...,m; JTAfc = 1. (4) 

fe=i 

The tree distributions T k are the mixture components 
and Afe are called mixture coefficients. A mixture of trees 
can be viewed as a containing an unobserved choice vari- 
able z, taking value k <G {1, . . .to} with probability Afe. 
Conditioned on the value of z the distribution of the vis- 
ible variables x is a tree. The to trees may have different 
structures and different parameters. A mixture of trees 
where all the trees have the same structure is equivalent 
toaTANB model [5]. 

2.3 The CL algorithm 

Let us now turn to learning trees from data. The ob- 



served data are denoted byT>~ {x 1 , 



,.A r 



}, where 



each x l represents a vector of observations for all the vari- 
ables in V. Learning a tree in the Maximum Likelihood 
framework, means finding a tree distribution T opt that 
satisfies 



N 

T o P t = argmax^logT(a; J ) 

2 — 1 



(5) 



This problem is a special case of the more general 
task of fitting a tree to a distribution P by minimizing 
KL(P\\T). The fact that the tree model is factorable 
allows for an efficient and elegant algorithm for solving 
this problem, owed to Chow and Liu [2]. The algorithm 
is briefly described here. 

The algorithm is based on the fact that all the infor- 
mation about a distribution P that is necessary to fit a 
tree is contained in its pairwise marginals P uv . If the dis- 
tribution is an empirical distribution defined by a data 
set T>, computing these marginals from data is a compu- 
tationally expensive step and takes 0{n 2 N) operations. 

Further, to find the tree structure, given by its set 
of edges E, one has to compute the mutual information 
between each pair of variables in V under the target 
distribution P 

Iuv = > , Puv{Xu,X v )lOg— — — — -, U,V£V,UfV. 
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(6) 

Since V has n variables, there are n(n — l)/2 mutual 
informations to be computed. After that, the optimal 
tree structure E is found by a Maximum Weight Span- 
ning Tree (MWST) algorithm using I uv as the weight for 



edge (u,v),\/u,v <G V. Computing the MWST can be 
done in several ways [3, 11, 4, 10]. The one we present 
here is called the Kruskal algorithm [3]. It is essentially 
a greedy algorithm. The candidate edges are sorted in 
decreasing order of their weights (i.e. mutual informa- 
tions). Then, starting with an empty E, the algorithm 
examines one edge at a time (in the order resulting from 
the sort operation), checks if it forms a cycle with the 
edges already in E and, if not, adds it to E. The algo- 
rithm ends when n — 1 edges have been added to E. 

Once the tree is found, its marginals T uv (u, v) E E 
are exactly equal to the corresponding marginals P uv of 
the target distribution P. 



TT = P uv for uveE 



(7) 



They are already computed as an intermediate step in 
the computation of the mutual informations I uv (6). 

3 Assumptions 

This section presents the assumptions that help us de- 
velop the accelerated CL algorithm. Of the four assump- 
tions stated below, the one that is essential is the spar- 
sity assumptions. The first two will be dispensed of or 
relaxed later on and are made here for the sake of sym- 
plifying the presentation. The last is a technical assump- 
tion. 

Binary variables. All variables in V take values in 
the set {0, 1}. When a variables takes value 1 we say 
that it is "on", otherwise we say it is "off". Without 
loss of generality we can further assume that a variable 
is off more times than it is on in the given dataset. 

Integer counts. The target distribution P is derived 
from a set of observations of size N. Hence, 



Pv(l) = ^ = 1-P„(0) 



(8) 



where N v represents the number of times variable v is 
on in the dataset. According to the first assumption, 

< P„(l) < \ or < N v < ^N (9) 

We can also exclude as non-informative all the variables 
that are always on (or always off) thereby ensuring the 
strict positivity of P v (l) and N v . 

Let us denote by N uv the number of times variables 
u and v are simultaneously on. We call each of these 
events a cooccurrence of u and v. The marginal P uv of u 
and v is given by 

N.P UV (1,1) = N uv (10) 

N.P UV (1,Q) = N u -N uv (11) 

N.P UV (0,1) = N v -N uv (12) 

N.P uv (0,Q) = N-N v -N u + N uv (13) 

All the information about P that is necessary for fit- 
ting the tree is summarized in the counts N, N v and 



N uv , u, v = 1, . . . , n that are assumed to be non-negative 
integers. From now on we will consider P to be repre- 
sented by these counts. 

Sparse data. Let us denote by < |x| < n the num- 
ber of variables that are on in observation x. Further, 
define s, the sparsity of the data by 



max \x 

i=l,N 



(14) 



If, for example, the data are documents and the variables 
represent words from a vocabulary, then s represents the 
maximum number of distinct words in a document. The 
time and memory requirements of the accelerated CL 
algorithm that we are going to introduce depend on the 
sparsity s. The lower the sparsity, the more efficient 
the algorithm. From now on, s will be assumed to be a 
constant and 

s « n, N. (15) 

Data/dimension ratio bounded The ratio of the 
number of data points N vs. the dimension of the do- 
main n is bounded. 



N 

— < R 

n 



(16) 



This is a technical assumption that will be useful later. 
It is a plausible assumption for large n and N. 

4 The accelerated CL algorithm 

4.1 First idea: Comparing mutual 

informations between binary variables 

The mutual information between two binary variables 
U, v € V can be expressed as: 



-* nv — 



1 

N 



{-N u log N U -(N- AQ log(N - N u ) + N log N 



- N v log N V -(N-N V ) log(JV - N v ) + Nlog N 

+N UV log N uv + (N u - N uv )log(N u - N uv ) 

+ (N V - N uv )log(N v - N uv ) 

+ (N-N U -N V + N uv ) log(N -N u -N v + N uv ) 

-log TV} (17) 

Let us focus on the pairs u, v that do not cooccur, i.e. 
for which N uv = 0. For such a pair, the above expression 
simplifies to 

Iuv = -^ {-(N - N u )log(N - N u ) 
-(N - N v )log(N - N v ) 
+(N -N u - N v ) log(N -N u - N v ) 
+AlogiV} (18) 

Knowing that N is fixed for the dataset, it follows that 
I uv in the cooccurrence case is function of 2 variables: 
N v and N u . Let us fix u (and consequently 7V„) and 
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variables 




Figure 1: The bipartite graph representation of a sparse 
data set. Each edge iv means "variable v is on in data 
point i" . 



analyse the variation of the mutual information w.r.t. 
N„: 



Oh 



dN v 



log(N -N v )- log(N - N u - N v ) 

N-N v 



log 
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N„ 
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(19) 



It follows that if N uv = then the mutual information 
I uv is monotonically increasing with N v for a fixed u. 
Put in other terms, for a given variable u and any two 
variables v, v' for which N uv — N uv i = 

N v > N v i implies that I uv > I uv > . 

This observation allows us to partially sort the mutual 
informations I uv for non-cooccurring pairs u,v, without 
computing them. First, we have to sort all the variables 
by their nubmber of occurrences N v . This gives an or- 
dering y of all the variables in V. 



v y u & N v > N u 



(20) 



Then, for each u, we create the list of variable following 
u and not cooccurring with it: 



V (u) = {v e V, v y u, N uv = 0} 



(21) 



This list is sorted by decreasing N v and therefore, im- 
plicitly, by decreasing I uv . Since the data are sparse, 
most pairs of variables do not cooccur. Therefore, by 
creating the lists Vq(u), a large number of the mutual 
informations are partially sorted. We shall show how to 
use this construction in section 4. Before that, let us 
examine an efficient way of computing the N uv counts 
when the data are sparse. 

4.2 Second idea: computing cooccurrences in a 
bipartite graph data representation 

The bipartite graph data representation. Let 



V = {x\ 



,.N 



} be a set of observations over n binary 



It is efficient to represent each observation in 2? as a list 
of the variables that are on in the respective observation. 
Thus, data point x % ,i = 1, ... N will be represented by 
the list xlist 1 = list{i> G V\x l v = 1}. The name of 
this representation comes from its depiction in figure 1 
as a bipartite graph (V, T>, £) where each edge (vi) G £ 
corresponds to variable v being on in observation i. The 
space required by this representation is no more than 

sN « nN, 

and much smaller than the space required by the binary 
vector representation of the same data. 

Computing cooccurrences in the bipartite 
graph representation First, let us note that the to- 
tal number Nq of cooccurrences in the dataset T> is 



N c 



vyu 



iV„„ < -s 2 N 



(22) 



where s is the previously defined sparsity of the data. 
Indeed, each data point x contains at most s variables 
that are on, therefore contributing at most s(s — l)/2 
cooccurrences to the sum in (22). 

As it will be shown shortly, computing all the cooc- 
currence counts takes the same amount of time, up to 
a logarithmic factor. The following algorithm not only 
computes all the cooccurrence numbers N uv for u, V € V 
but also constructs a representation of the lists Vq(u). 
Because the data are sparse, we expect the lists Vq{u) 
to contain on average many more elements than their 
respective "complements" 



V (u) = {v e V, v y u, N uv > 0} 



(23) 



variables whose probability of being on is less than 1/2. 



Therefore, instead of representing Vq(u) explicitly, we 
construct Vq (u) representing the list of all the variables 
that follow u and cooccur with u at least once, sorted 
by N v . We assume that all the variables are already 
sorted by decreasing N v , with ties broken arbitrarily and 
form a list L. Hence, to traverse the virtual list Vq(u) 
in decreasing order of I uv , it is sufficient to traverse the 
list L startin at the successor of u, skipping the variables 
contained in Vq(u). 

For the variables that cooccur with u, a set of cooccur- 
rence lists C(u) is created. They that contain the same 
variables as Vq(u) (i.e the variables that follow u and 
cooccur with it) but sorted by the mutual informations 
I uv rather then by N v . 

For each variable v we shall create a temporary stor- 
age of cooccurrences denoted by cheap v organized as a 
Fibonacci heap (or F-heap) [4]. Then, the cooccurrence 
counts N uv and the lists C(u),Vo(u), u G V can be 
computed by the following algorithm: 

Algorithm ListByCooccurrence 

Input list of variables sorted by decreasing N u 
dataset V = {xlist 1 , i — 1, . . . N} 



l.for v — 1, . . . n 

initialize cheap v 
2.for i = 1,...N 
for u € xlist 1 

for v £ xlist 1 , v y u 
insert u into cheap v 
3. for v = 1, . . . n / / empty cheap v and construct the lists 
c = 0, u = v 
while cheap v not empty 

Unew — extract max cheap v 
if {u new >- u) and (w >- i>) 
insert t> in Vo(w) at the end 
insert (v, c) in C(u) 

U Unew , C 1 

else 

C+ + 

if (u >- v) J) last insertion 

insert v in Vo(w) at the end 

insert (v,c) in C(u) 
4. for u = 1, . . . n 
for (v, c) € C(u) 

compute I uv and store it in C(u) together with (v, c) 
sort C(u) by decreasing J M „ 

Output lists Vo(u), C(u) u = 1, . . . n 

The algorithm works as follows: cheap v contains one 
entry for each cooccurrence of u and v. Because we ex- 
tract the elements in sorted order, all cooccurrences of 
v with u will come in a sequence. We save u and keep 
increasing the count c until a new value for u (denoted 
above by u new ) comes out of the F-heap. Then it is 
time to store the count c and v in it's lists. Since the 
v's are examined in decreasing order, we know that the 
new elements in the Vq(u) lists have to be inserted at 
the end. The position of the insertion in C(u) is not im- 
portant, since C(u) will subsequently be sorted, but it is 
important to store the number of cooccurrences c = N uv 
that will enables us to compute the mutual information 
I uv in step 4 of the algorithm. The computed mutual 
informations are also stored. 

Running time. As shown above, an insertion at the 
extremity of a list takes constant time. Extracting the 
maximum of an F-heap takes logarithmic time in the size 
of the heap. Thus, extracting all the elements of a heap 
cheap v of size l u takes 



^ log fc = log l u \<l u log l u . 



fc=i 



Bounding the time to empty the heaps cheap v . Ex- 
tracting all elements of all heaps cheap v takes therefore 
less than r = ^2 U l u log l u . Knowing already that 



]£*„ = N c < l/2s 2 N 



(24) 



r ~ 0(s 2 Nlog — —). The total number of list insertions 
is at most proportional to r. It remains to compute 
the time needed to create cheap v . But we know that 
insertion in an F-heap takes constant time and there are 
Nc cooccurrences to insert. 

In the last step, we have to sort the lists C(u). Sort- 
ing a list of length I' takes I' log I' time. The sum of all 
list lengths is no larger than the total number of cooc- 
currences Nc and by a reasoning similar to previous one 
we conclude that the running time of this step is also 
0(s 2 Nlog — —). Therefore the whole algorithm runs in 
time of the order 



0(s 2 Anog 



(25) 



it is easy to prove that the maximum of r is attained for 
all l u equal. After performing the calculations we obtain 



Memory requirements. The memory requirements 
for the temporary heaps cheap v are equal to Nc- The 
space required by the final lists Vo(u),C(u) is itself no 
more than proportional to the total number of cooccur- 
rences Nc, namely 0(s 2 N). Thus the total space re- 
quired for this algorithm is 0(s 2 N). 

4.3 Putting it all together: the acCL algorithm 
and its data structures 

So far, we have efficient methods for computing the cooc- 
currences and for partially sorting the mutual informa- 
tions of the variables in Vq(u) for all u £ V. What we 
aim for is to create a mechanism that will output the 
edges uv in the decreasing order of their mutual infor- 
mation. If this is achieved, then the Kruskal algorithm 
can be used to construct the optimal tree. 

We shall set up this mechanism in the form of a Fi- 
bonacci heap (F-heap) [4] called vheap that contains an 
element for each u £ V, represented by the edge with the 
highest mutual information among the edges uv, v y u. 
The maximum over this set will obviously be the max- 
imum over the mutual informations of all the possible 
edges not yet eliminated. The record in vheap is of the 
form (c, u, v, I uv ), with v >- u and I uv being the key used 
for sorting. Once the maximum is extracted, the used 
edge has to be replaced by the next largest (in terms of 
I uv ) edge in u's lists. 

To perform this latter task it easy now: for each u we 
have the list of variables cooccurring with u, already 
sorted by mutual information. For the variables not 
cooccurring with u we have the virtual list Vq(u) sorted 
by mutual information as well, but for which the mutual 
information values are not yet computed. All we have to 
do is to compute I UVg with Vo being the head of Vo(u). 
By comparing I UVo with the mutual information of the 
head of C(u) we find 

max Iuv 

vyu 

This value, together with its corresponding u,v,N uv is 
the value to be inserted in the F-heap. Every time an 
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- ■■■■list of v y u, N uv > 0, sorted by /„„ 

' nnnnnnnlist of v y u, N uv = 0, sorted by N v 



+F-heap of size n 



Figure 2: The acCL algorithm: the data structure that supplies the next weightiest candidate edge. Vertically to the 

left are the variables, sorted by decreasing N u . For a given u, there are two lists: C(«), the list of variables v y u, 

sorted in decreasing order of I uv and (the virtual list) Vq(u) sorted by decreasing N v . The maximum of the two first 

elements of these lists is max Iuv that is inserted into an F-heap.The overal maximum of I uv can then be extracted 

vyu 
as the maximum of the F-heap. 



edge (u, v,N UVl I uv ) is inserted in vheap, v is deleted 
from it's coresponding list. The data structures involved 
in this process are schematically shown in figure 4.3. 

With this mechanism in place the Kruskal algorithm 
[3] can be used to construct the desired spanning tree. 
The outline of the algorithm is 

Algorithm acCL 



Inputvariable set V of size n 

dataset T> = {xlist 1 , i = 1, . . . N} 
1. compute N v for v £ V 

create vlist, list of variables in V sorted by decreasin 
2.ListByCooccurrence 
3. create vheap 
for u € vlist 

v = argmax I uv 

hcadC u ,hcadVo(^) 

insert (c,u,v,I uv ) in vheap 
4.E = KruskalMST(v/ieap); store the c = N uv values 

for the edges added to E 
5. for uv £ E 

compute the probability table T uv using N U ,N V , 

N uv and N. 
Output T 



4.4 Time and storage requirements 

Running time In the first step we have to compute the 
variables' frequencies N v . This can be done by scanning 
trough the data points and by increasing the correspond- 
ing N v each time v is found in xlist 1 for i — 1, ... N. This 
procedure will take at most sN operations. Adding in 
the time to initialize all N v (0(n)) and the time to sort 
the N v values (O(nlogn)) gives an upper bound on the 
running time for step 1 of the acCL algorithm of 

0(nlogn+ sN) 

The running time of the second step is already estimated 

to 

s 2 /V 

0{s 2 N log ) 

n 

Step 3, constructing a Fibonacci heap of size n, takes 
constant time per insertion, thus 

0{n) 



An additional amount of time may be needed to extract 
elements from the virtual lists Vo(u) by skipping the 
cooccurring elements, but this time will be accounted 
for in step 4. 

Step 4 is the Kruskal algorithm. Each extraction from 
vheap is O(logn). All the extractions from the virtu- 
ally represented Vq(u) take no more than tlk + Nc time 
steps since there are at most Nc elements that have to be 
skipped. An extraction from C(u) takes constant time. 
riK is the number of steps taken by Kruskal's algorithm. 
After choosing a candidate edge, the Kruskal algorithm 
checks if it forms a cycle with edges already included in 
the tree, and if not, it includes it too. By using an effi- 
cient disjoint set representation [3] the former operation 
can be performed in constant time; the second opera- 
tion, equivalent to insertion in a list, can be performed 
in constant time as well. Therefore, the running time for 
this step is 

0(n K logn + N c ) 

The last step computes n—1 probability tables, each of 
them taking constant time. Thus, its running time is 



0(n) 

Adding up these five terms we obtain the upper bound 
for the running time of the acCL algorithm as 



0{n log n + sn + s 2 N log 



s z N 



n K log n) (26) 



Ignoring the logarithmic factors the bound becomes 

6{sn+s 2 N + n K ) (27) 

For constant or bounded s, the above bound is a poly- 
nomial of degree 1 in the three variables n, N and rix- 
However, we know that uk the total number of edges 
inspected by Kruskal's algorithm has the range 



1 < n K < 



n(n — 1) 



(28) 



Hence, in the worst case the above algorithm is quadratic 
in n. However, there are reasons to believe that in prac- 
tice the dependence of nx on n is subquadratic. Ran- 
dom graph theory suggests that if the distribution of the 
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Figure 3: The mean (full line), standard deviation and 
maximum (dotted line) of Kruskal algorithm steps nx 
over 1000 runs plotted against nlogn. n ranges from 5 
to 3000. The edge weights were sampled from a uniform 
distribution. 



weight values is the same for all edges, then Kruskal 's 
algorithm should take a number of steps proportional to 
nlogn [12]. This result is sustained by experiments we 
have conducted: we ran Kruskal algorithm on sets of ran- 
dom weights over domains of dimension up to n = 3000. 
For each n, 1000 runs were performed. Figure 3 shows 
the average and maximum nx plotted versus nlogn. 
The curves display a close to linear dependence. 

Memory requirements To store data and results we 
need: O(sN) for the dataset in the bipartite graph repre- 
sentation, 0(n) to store the variables and another 0(n) 
to store the resulting tree structure and parametrization. 

The additional storage required by the algorithm in- 
cludes 0(n) for v — list, then 0(s 2 N) for all the lists 
created in step 2 of the algorithm. In step 3, vheap is 
created taking up 0(n) memory. The Kruskal algorithm 
requires 0(ri) storage to keep track of the tree topology. 
Computing the probability tables in step 5 requires no 
additional storage besides that taken up by the distribu- 
tion itself. 

Hence, the total space used by the algorithm is 



0(s 2 N + n). 



(29) 



5 Generalization to discrete variables of 
arbitrary arity 

This section will show how the Chow and Liu algorithm 
can be accelerated in discrete domains where the vari- 
ables can take more than two values. We shall assume 
that a variable v £ V takes r v values, 2 < r v < Tmax- 
The algorithm that we are going to develop is a simple 



extension of the acCL algorithm. Therefore, we shall 
present only the ideas that lead to extension and the 
modifications to the acCL algorithm that they imply. 

The acCL algorithms introduced previously were ex- 
ploiting the data sparsity. If they are to be generalized, 
it is first necessary to extend the notion of sparsity itself. 
Thus, in the forthcoming we shall assume that for each 
variable exists a special value that appears with higher 
frequency than all the other values. This value will be 
denoted by 0, without loss of generality. For example, in 
a medical domain, the value for a variable would rep- 
resent the "normal" value, whereas the abnormal values 
of each variable would be designated by non-zero values. 
Similarly, in a diagnostic system, will indicate a normal 
or correct value, whereas the non-zero values would be 
assigned to the different failure modes associated with 
the respective variable. 

An occurence for variable v will be the event d^O and 
a cooccurrence of u and v means that u and v are both 
non-zero in the same data point. Therefore, we define 
|x| as the number of non-zero values in observation x 



\x\ =n — 



vev 



S Xv 



(30) 



The sparsity s will be the maximum of \x\ over the data 
set, as before. 

From the above it can be anticipated that the high fre- 
quency of the values will help accelerate the tree learn- 
ing algorithm. As before, we shall represent only the 
occurrences explicitly, creating thereby a compact and 
efficient data structure. Moreover, we shall demonstrate 
presorting mutual informations for non-cooccurring vari- 
ables in this case can be done in the same way as before. 

5.1 Computing cooccurrences 

Following the previously introduced idea of not repre- 
senting values explicitly, each data point x will be re- 
placed by the list xlist of the variables that occur in 
it. However, since there can be more than one non-zero 
value, the list has to store this value along with the vari- 
able index. Thus 

xlist = list{(v, x v ), v £ V,x v y^ 0}. 

Similarly, a cooccurrence will be represented by the 
quadruple (u, x u ,v,x v ), x Ul x v ^ 0. Counting and stor- 
ing cooccurrences can be done in the same time as before 
and with a proportionally larger amount on memory, re- 
quired by the additional need to store the (non-zero) 
variable values. 

Instead of one cooccurrence count N uv we shall now 
have a two-way contingency table A™, . Each N^ v repre- 
sents the number of data points where u = i,v = j, i,j =/= 
0. This contingency table, together with the marginal 
counts Nl (defined as the number of data points where 
v = j, j 7^ 0) and with N completely determine the 



joint distribution of u and v (and consequently the mu- 
tual information I uv ). Constructing the cooccurrence 
contingency tables multiplies the storage requirements 
of this step of the algorithm by 0(r 2 MAX ) but does not 
change the running time. 

5.2 Presorting mutual informations 

As in subsection 4.1, our goal is to presort the mutual 
informations I uv for all v >~ u that do not cooccur with 
u. We shall show that this can be done exactly as be- 
fore. The derivations below will be clearer if they are 
made in terms of probabilities; therefore, we shall use 
the notations: 



pression of the mutual information I uv of two non cooc- 
curring variables u, v in terms of P u q , P v q and H^ only. 



I« 



Hi, — H, 



u\v 



(35) 



The second term, the conditional entropy of u given v is 



H u \v — PvoHu\v=0 + ^_^ 
3+0 




(36) 



'»(») = 


N ' ^ 


(31) 


PvO E 


e P„(o) = 1-J2 p v(*) 

i+O 


(32) 



The above quantities represent the (empirical) probabil- 
ities of v taking value i ^ and respectively. Entropies 
will be denoted by H. 

A "chain rule" expression for the entropy of a 
discrete variable. The entropy H v of any multivalued 
discrete variable v can be decomposed in the following 
way: 

H v = (33) 



The last term in the above equation is because, for 
any non-zero value of v, the condition N uv = implies 
that u has to be 0. Let us now develop H u i v —q using the 
decomposition in equation (34). 

H u \v=a = Huo\v=o + (1 ~~ P u =o\v=o) H u \u^o,v=o (37) 

Because u and v are never non-zero in the same time, 
all non-zero values of u are paired with zero values of v. 
Hence, knowing that v = brings no additional infor- 
mation once we know that a^O. In probabilistic terms: 
Pr[u = i\u ^ 0, v = 0] = Pr[u = i\u ^ 0] and 



H„ 



u\u^0,v— 



H-77 



(38) 



The term H u=0 \ v= q is the entropy of a binary variable 
whose probability is Pr[u — 0\v = 0]. This probability 
equals 



Pr[u = Q\v = 0] 



— PvV 


log 


PvO — 


i#0 


Pv( 


:)iogP v 


:*) 


— PvV 


log 


PvO — 


(1- 


*v 


i#0 V 


P v (i) 

— PvO) 


+ ioi 


;(i- 


- Pvo) 










— PvC 


log 
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(1- 


*v 


o)log(l 


- Pvo) 
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"(1- 


-p. 


z/0 


P. 
(1- 


Pv 


o) ^ (1 


P v (i) 

— PvO) 



i - y~*,p u \v=o(i 

i^0 

P u {i) 



log 



Py(j) 
(1 - PvO) 



- i-E 



i 



, PvO 

1^0 

1 ~ PuO 

1 — PvO 



(39) 



-H- 



— Hvo + (1 — Pvo)H- 



Note that in order to obtain a non-negative probability 
in the above equation one needs 

1 — PuO < PvO 

a condition that is always satisfied if u and v do not 
cooccur. 
/'34~) Replacing the previous three equations in the formula 
of the mutual information, we get 



This decomposition represents a sampling model 
where first we choose whether v will be zero or not, 
and then, if the outcome is "non-zero" we choose one of 
the remaining values by sampling from the distribution 
P v \v^o(i) = i-p ■ HvO is the uncertainty associated 
with the first choice, whereas Hy = H v i V7 tQ is the en- 
tropy of the outcome of the second one. The advantage 
of this decomposition for our purpose is that it separates 
the outcome from the others and "encapsulates" the 
uncertainty of the latters in the number H-. 

The mutual information of two non-cooccurring 
variables We shall use the above fact to find an ex- 



= P u0 log P u0 - P v0 log P v0 
+ (P u o + PvO-l)log(P u0 



(40) 



PvO - 1) 



an expression that, remarkably, depends only on P u q and 
P v q. Taking its partial derivative with respect to P v q 
yields 

.p- = log p < (41) 

O-TvO "vO 

a value that is always negative, independently of P^o- 
This shows the mutual information increases monoton- 
ically with the "occurrence frequency" of v given by 
1 — P v q. Note also that the above expression for the 



derivative is a rewrite of the result obtained for binary 
variables in (19). 

We have shown that the acCL algorithm can be ex- 
tended to variables taking more than two values by mak- 
ing only one (minor) modification: the replacement of 
the scalar counts N v and N uv by the vectors N£, j ^ 
and, respectively, the contingency tables N^ v , i,j ^ 0. 

6 Using the acCL algorithm with EM 

So far it has been shown how to accelerate the CL al- 
gorithm under the assumption that the target probabil- 
ity distribution P is defined in terms of integer counts 3 
N, N vt N uv , u,v <G V. This is true when fitting one tree 
distribution to an observed data set or in the case of clas- 
sification with TANB models where the data points are 
partitioned according to the observed class variable. But 
an important application of the CL algorithm are mix- 
tures of trees [8] , and in the case of learning mixtures by 
the EM algorithm the counts defining P for each of the 
component trees are not integer. 

Each Expectation step of the EM algorithm computes 
the posterior probability of each mixture component k of 
having generated data point x % . This values is denoted 
by 7fe(i). The values 7 have the effect of "weighting" the 
points in the dataset T> with values in [0,1], different for 
each of the trees in the mixture. The counts N k and 
N k v corresponding to tree T k are defined in terms of the 

7 values as 

n 



N k -- 






(42) 


N k = 


= E 7fe ^ 

2:2;* __1 


) 


(43) 




= E 


lk{i)- 


(44) 



i'.x* =lAx* =1 



These counts are in general not integer numbers. Let 
us examine steps 1 and 2 of the acCL algorithm and 
modify them in order to handle weighted data. 

First, remark that if the mixture has m components, 
step 1 will have to sort the variables m times, producing 
m different vlists, one for each of the components. Com- 
puting the JV^ values is done similarly to the previous 
section; the only modification is that for each occurrence 
of v in a data point one adds -fk (i) to N k instead of incre- 
menting a counter. Remark that no operations are done 
for pairs of variables that do not cooccur in the original 
data set, preserving thereby the algorithm's guarantee 
of efficiency. 

For step 2, a similar approach is taken. At the time 
of inserting in cheap k one must store not only v but also 
7fe(i). When the heap is emptied, the current "count" c 
sums all the jk(i) values corresponding to the given v. 



Note also that one can use the fact that the data are 
sparse to accelerate the E step as well. One can precom- 
pute the "most frequent" likelihood T k = T fc (0,...,0) 
for each k. Then, if v is 1 for point x 1 one multiplies 



Tq by the ratio 



SjpaW 



(l|pa(«)) 



also precomputed. This 



W)(°IP a K» 

way the E step will run in 0(msN) time instead of the 
previously computed 0(mnN). 

7 Factorized priors and the acCL 
algorithm 

All of the above assumes that the tree or mixture is to 
be fit to the data in the maximum likelihood framework. 
This section will study the possibility of using priors in 
conjunction with the acCL algorithm. The classes of 
priors that we shall be concerned with are the decom- 
posable priors that preserve the parameter independence 
assumptions on which the CL algorithm is based. For an 
in-depth discussion of decomposable priors the reader is 
invited to consult [7]. We shall first examine priors on 
the tree's structure having the form 



P(E) oc exp I - Y, & 



uvGE 



This prior translates into a penalty on the weight of edge 
uv as seen by a MWST algorithm 



W„ 



I, 



Puv 



3 In this and the subsequent sections we return to the bi- 
nary variable assumption for the sake of notational simplicity. 



It is easily seen that for general (3 UV values such a mod- 
ification cannot be handled by the acCL algorithm. In- 
deed, this would affect the ordering of the edges out- 
going from u in a way that is inpredictable from the 
counts N v , N uv . However, if (3 UV is constant for all pairs 
u,v <G V, then the ordering of the edges is not affected. 
All we need to do to use an acCL algorithm with a con- 
stant penalty is to compare the I uv of each edge, at 
the moment it is extracted from vheap by Kruskal's al- 
gorithm, with the quantity -j|. The algorithm stops as 
soon as one edge is found whose mutual information is 
smaller than the penalty -^ and proceeds as before oth- 
erwise. Of course, in the context of the EM algorithm, 
N is replaced by N k and (3 can be different for each com- 
ponent of the mixture. Remark that if all the variables 
are binary (or have the same number of values) an MDL 
type edge penalty translates into a constant (3 UV . 

Regarding Dirichlet priors on the tree's parameters, 
it is known [7] that they can be represented as a set of 
fictitious counts N' uv , u, v £ V and that maximizing the 
posterior probability of the tree is equivalent to minimiz- 
ing the KL divergence 

KL(P\\T) 

with P a mixture between the empirical distribution P 



and the fictitious distribution P' defined by N' uv . 



P = 



N 



:P 



N' 



N + N 1 N + N' 



-.P' 



(45) 



In this situation, the basic sparsity assumption that the 
acCL algorithm relies on may be challenged. Recall that 
it is important that most of the N uv values are 0. If 
the counts N uv violate this assumption then the acCL 
algorithms become inefficient. In particular, the acCL 
algorithm degrades to a standard CL algorithm. Having 
many or all N' uv > is not a rare case. In particular, it 
is a characteristic of the non-informative priors that aim 
at smoothing the model parameters. This means that 
smoothing priors and the acCL algorithm will in general 
not be compatible. 

Somehow suprisingly, however, the uniform prior given 

by 

Kv = — N' (46) 

'U'V 

constitutes an exception: for this prior, in the case of bi- 
nary variables, all the fictitious cooccurrence counts are 
equal to N' /A. Using this fact, one can prove that for 
very small and for large values of N' (> 8) the order 
of the mutual informations in Vq(u) is preserved and re- 
spectively reversed. This fact allows us to run the acCL 
algorithm efficiently after only slight modification. 

8 Experiments 

The following experiments compare the (hypothesized) 
gain in speed of the acCL algorithm w.r.t the traditional 
Chow and Liu method under controlled conditions on 
artificial data. 

The binary domain has a dimensionality n varying 
from 50 to 1000. Each data point has a fixed number 
s of variables being on. The sparsity s takes the val- 
ues 5, 10, 15 and 100. The small values were chosen 
to gauge the advantage of the accelerated algorithm un- 
der extremely favorable conditions. The larger value will 
help us see how the performance degrades under more 
realistic circumstances. Each data point (representing 
a list of variables being on) was generated as follows: 
the first variable was picked randomly from the range 
1, . . . n; the subsequent points were sampled from a ran- 
dom walk with a random step size between -4 and 4. For 
each pair n, s a set of 10,000 points was generated. 

Each data set was used by both CL and acCL to fit one 
tree distribution and the running times were recorded 
and plotted in figure 4. The improvements over the tra- 
ditional version for sparse data are spectacular: learn- 
ing a tree over 1000 variables from 10,000 data points 
takes 4 hours by the traditional algorithm and only 4 
seconds by the accelerated version when the data are 
sparse (s = 15). For s = 100 the acCL algorithm takes 
2 minutes to complete, improving on the traditional al- 
gorithm by a factor of "only" 123. 
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What is also noticeable is that the running time of 
the accelerated algorithm seems to be almost indepen- 
dent of the dimension of the domain. On the other side, 
the number of steps Uk (figure 5) grows with n. This 
observation implies that the bulk of the computation lies 
with the steps preceding the Kruskal algorithm proper. 
Namely, that it is in computing cooccurrences and or- 
ganizing the data that most of the time is spent. This 
observation would deserve further investigation for large 
real- world applications. 

Figure 4 also confirms that the running time of the 
traditional CL algorithm grows quadratically with n and 
is independent of s. 

9 Concluding remarks 

This paper has presented a way of taking advantages 
of sparsity in the data to accelerate the tree learning 
algorithm. 

The method achieves its performance by exploiting 
characteristics of the data (sparsity) and of the prob- 
lem (the weights represent mutual informations) that 
are external to the Maximum Weight Spanning Tree al- 
gorithm proper. Moreover, it has been shown empiri- 
cally that a very significant part of the algorithms' run- 
ning time is spent in computing cooccurrences. This 
prompts future work on applying trees and mixtures of 
trees to high-dimensional tasks to focus on methods for 
structuring the data and for computing or approximat- 
ing marginal distributions and mutual informations in 
specific domains. 
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